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We study the algorithmic optimization and performance tuning of the Lattice QCD clover-fermion 
solver for the K computer. We implement the Luscher's SAP preconditioner with sub-blocking 
in which the lattice block in a node is further divided to several sub-blocks to extract enough 
parallelism for the 8-core CPU SPARC64™ VHIfx of the K computer. To achieve a better 
convergence property we use the symmetric successive over-relaxation (SSOR) iteration with 
locally-lexicographical ordering for the sub-blocks in obtaining the block inverse. The SAP pre- 
conditioner is included in the single precision BiCGStab solver of the nested BiCGStab solver. 
The single precision part of the computational kernel are solely written with the SIMD oriented 
intrinsics to achieve the best performance of the SPARC64™ VHIfx on the K computer. We 
benchmark the single precision BiCGStab solver on the three lattice sizes: 12 3 x24, 24 3 x48 and 
48 3 x 96, with fixing the local lattice size in a node at 6 3 x 12. We observe an ideal weak-scaling 
performance from 16 nodes to 4096 nodes. The performance of a computational kernel exceeds 
50% efficiency, and the single precision BiCGstab has ~26% susutained efficiency. 
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1. Lattice QCD on the K computer 

The K computer has been developed by RIKEN and Fujitsu since 2006 as the national leader- 
ship computer of Japan to promote science and technology |jlj|. The construction has been finished 
on July 2012 and the system is now provided to the strategic programs of the High Performance 
Infrastructure (HPCI) and to the research users those who apply the computer resource through the 
HPCI [|0. The strategic programs consist of five fields: (i) Predictable life science, healthcare and 
drug discovery foundation, (ii) New Materials and Energy Creation, (iii) Projection of Planet Earth 
Variations for Mitigating Natural Disasters, (iv) Next-generation manufacturing technology, and (v) 
The origin of matter and the universe. Lattice QCD is contained in the fifth strategic program [J3j] . 

The K computer consists of over 80,000 computational nodes connected by the so called 
"Tofu" network. The Tofu network topology is six dimensional topology with 3D-mesh times 
3D-torus shape, with which robustness against the node failure and high node flexibility and avail- 
ability are ensured. Each node has a sing le CPU chip called "SPARC64™ VUIfx ". Equipping 
8 cores with SIMD enabled 256 registers and 6MB shared L2 cache, the CPU can achieve a high 
efficiency for scientific applications (for the detailed system structure see [Q]). Lattice QCD is one 
of the suitable applications for this kind of massively parallel system architecture. 

In this paper we present the algorithmic and performance tuning of the 0(a) improved Wilson 
quark solver for the K computer. We focus on the solver algorithm preconditioned by the domain- 
decomposed Schwartz alternating procedure (SAP) with mixed precision In the next section we 
briefly introduce the SAP preconditioner and the nested BiCGStab algorithm The optimization 
of the algorithm and tuning methods for the computational kernels of the SAP are presented in 
section ||. We give the performance benchmark results in section ^| and summarize the paper in the 
last section. 

2. Nested BiCGStab with LUscher's SAP preconditioner 

Our target problem is solving the following linear equation: 

Dx = b, (2.1) 
where D is the clover term preconditioned O (a) -improved Wilson-Dirac operator: 
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D a ^(n,m) = S«' b S a ,p8(n,m)-KFZf y (n) £ (1 - y^U^n))^ 8(n + fi,m) 



(2.2) 



Where F(n) is the inverse clover term (1 — (cswK"/2)a MV F MV («)) _1 , (n,m) are lattice site, a, b 
color, a, j3 are spin indexes. 

Luscher has introduced the Schwartz alternating procedure (SAP) to further preconditioning 
the Eq. (IO) [S]. By dividing the whole lattice into two colored blocks in checkerboard manner, 



Eq. ( |2.1[ ) can be rewritten in the following 2x2 block form: 

(2.3) 



Dee D eo \ I x E 
Doe Dqo I \xo 
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where Dee (Doo) is a block restricted operator in even-domain (odd-domain), while Deo (Doe) 
contains hopping operations from odd-domain to even-domain (and vice versa). The SAP practi- 
tioner Msap is introduced as 

M SAP = K £ (l-DK)j, with K= . , (2-4) 

i= o \ — a oo u oeAeo a OO J 

where Aee (Aoo) can be any approximation for (Dee) 1 ((Doo) 1 )- When Aee and Aoo are exact, 
DK becomes block triangular and is expected to be well preconditioned. In the case \DK\ < 1, Msap 
converges to D 1 when A^sap — > °°- 

The nested BiCGStab solver is designed to be flexible against the preconditioner changing 
iteration by iteration [|]] using an inner-outer strategy. The outer BiCGStab contains the inner 
BiCGStab solver as the flexible preconditioner. The flexibility ensures the double precision accu- 
racy of the solution vector even if we use the single precision for the inner BiCGStab solver [^]. 
We apply the SAP preconditioner Msap to the inner single precision BiCGStab solver. The use of 
single precision has a merit as it requires less system resources than double precision. We target 
the single precision part of the solver as the tuning part for the K computer. In the following section 
we focus on the tuning of the single precision DK of Msap- 

3. Performance tuning for the K computer 

Inverse of block operator and OpenMP threading The approximate inverse of the block op- 
erator Aee (Aoo) are the important part of the SAP. The exactness is not required for the SAP, 
however, the better approximation with less computational cost is preferred for A E e (Aoo)- The 
even-odd site preconditioning has been used in [||] and the SSOR preconditioning has been used 
in []|]. The latter has a better performance than the former at the same computational cost. The 
SSOR preconditioning is derived by decomposing the original operator into the sum of an upper 
and a lower triangular matrices. The preconditioning is achieved by solving the upper and lower 
triangular parts through forward and backward substitutions. The decomposition and the efficiency 
of the SSOR depend on the site ordering. It is observed that the SSOR with natural ordering have a 
better performance [gj. However the SSOR with natural ordering is not suitable for the multi-core 
CPU architecture because the natural ordering has a global data recurrence pattern and less paral- 
lelism in the forward and backward substitutions. To extract 8 core parallelism for the K computer, 
we further divide the block into 16 sub-blocks via the locally-lexicographical ordering (//-ordering) 
described in 

Figure |l| shows an example of 6 4 lattice in a node. The block in a node is divided into 2 4 
sub-blocks. Each single core contains two sub-blocks (two sub-blocks with size 3 4 adjacent in 
temporal direction). The numbering on the sites is the ordering according to the //-ordering. The 
arrows on the links represent the data recurrence direction. The most of the recurrence are limited 
in each sub-block and there are little data reference on the surface of sub-blocks. Based on this 
ordering we wrote the forward and backward solver to construct Aee (Aoo) with the SSOR. The 
parallelization in a node is achieved by explicit OpenMP threading. 8 threads are invoked and two 
sub-blocks are assigned to each OpenMP thread. The spatial division is mapped to OpenMP 8 
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Spatial-Temporal dir. 



threads, and the temporal division is dedicated to the loop unrolling in a single thread. To resolve 
the recurrence dependency among threads, we carefully insert explicit omp barrier's at the sites 
that require data on other threads. 

The sub-block size affects the s P at 1 a 1 d 1 rect ' on 

performance of the SAP. It has been 
reported that the size of 2 4 for block 
still has a gain against the even-odd 
site ordering on a larger size lat- 
tice [^J. Although we apply the 11- 
ordering to a small lattice (corre- 
sponds to a block in a node), we 
observed that the use of the SSOR 
with sub-blocking via //-ordering 
for Aee still has a gain against the 
even-odd site ordering. 




Core* J 



Figure 1: Site ordering in a CPU. This is an example with a 6 4 
block in a node. The numbering of sites is limited in the 2-D 
plane for simplicity. 



SIMDzation The SPARC64™ VHIfx CPU has 256 double precision (64bit) FP registers per 
core. Each core can issue two independent fused multiply add (FMAD) on paired registers (SIMD) 
in a cycle resulting in 8 floating operations per cycle. To fully make use of these core infrastructure 
we have to extract enough independent data-computation stream in the kernel code Dee and Aee- 
The SIMD FMAD operation can be explicitly invoked using intrinsic functions defined in the 
C/C++ language provided by Fujitsu. Although our entire program codes were developed in the 
Fortran90 at first, we totally write the single precision inner BiCGStab solver using the SIMD 
intrinsic functions in the C language. The details of the intrinsics are not available publicly, while 
these have one-to-one correspondence to the mnemonic of the SPARC64™ VHIfx [10]. The 



linkage between the Fortran90 and the C codes is realised through the ISO_C_BINDING feature 
of the Fortran 2003 which is partly included in the Fujitsu Fortran. We use double precision intrinsic 
functions for the SIMD computation because the single precision FUOPS performance is identical 
to that in double precision in this CPU architecture. The data are converted from single precision 
to double precision (and vice verse) at the data loading (storing) from (to) memory. We employ the 
SU (3) reconstruction method in the kernel code where the link variables are stored in compressed 
form (keeps only two-columns of the SU (3) matrix) and the third column is reconstructed using 
the unitarity condition on the fly. 

To further improve the kernel performance 
we unroll the temporal direction loop, which is 
the innermost site loop, by three times for Dee- 
For the forward/backward solver in Aee the two 
independent sub-blocking in temporal direction is 
used as unrolling. This is based on the technique 
of the register blocking which hide the data load 
latency. The unrolling also contains branch elimination by condition merging. This loop unrolling 
with huge loop body is possible thanks to the many registers of the SPARC64™ VHIfx . 

In table [T] we summarize the total flop and load/store count per site for the hopping matrix of D. 



St/ (3) 
reconst. 


Flop 
count 


Load/Store 
count (real) 


Real/Flop 


Yes 


2232 


372 


0.1667 


No 


1896 


420 


0.2215 



Table 1: Flop count and load/store count for the 
clover hopping matrix per site. 
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These numbers are estimated for bulk sites in a block (not for the surface sites). We use the Dirac 
representation for 7^ (74 is diagonal) in the spin projection, and use the chiral representation for the 
inverse clover term F(n). When we multiply F(n), it is converted to the Dirac representation on 
the fly. The required byte/flop is 0.667 with the SU (3) reconstruction method in single precision. 
The theoretical system byte/flop is 64 [GByte/s]/128 [GFlops] = 0.5. Thus our computation is still 
limited by the memory bandwidth. The maximum theoretical performance is estimated to be ~ 96 
GFlops from the system memory bandwidth of 64 GByte/s. Excluding the redundant flop count 
caused by the SU (3) reconstruction, the effective performance is ~ 82 GFlops, which is still better 
than that without the SU (3) reconstruction (~ 72 GFlops). This performance number could be 
reduced or enhanced by the cache system, site loop control, conditional branch for site location, 
thread controlling etc. We test and benchmark the Dee and Aee kernels. 



Algorithm 1 w = DKv computation with com- 
munication hiding, fg and fo are working vec- 



Communication hiding The internode com- 
munication in the SAP kernel DK is arranged 
in the Deo an d Doe kernels. The structure 
of Deo (F>oe) is basically organized as fol- 
lows: (i) spin projection, link U^(m) multi- 
plication and data packing, (ii) sending data, 
(iii) receiving data, (iv) link U^(n) multipli- 
cation, spin reconstruction and accumulation. 
Other computation is possible during the step 
(ii) and (iii). 

We organize the SAP kernel DK to hide 
the communication time of Deo (F>oe) behind 
the computation of Dee (F>oo) as shown in 
Alg. |TJ To hide the communication we em- 
ploy the non-blocking MPIs: MPIJsend, MPIJrecv, and MPI_Wait. The steps (i)-(ii) are done 
at the lines 2 and 7, and the steps (iii)-(iv) are at the lines 4 and 10. We benchmark the commu- 
nication performance in the weak-scaling test by comparing the performance of Dee and DMsap, 
where the latter contains the internode communication while the former does not. 



tors. 


1 


x E = Aee\'e 




2 


MPIJsend and MPIJrecv for wo 


= Doexe ■ 


3 


w E = D EE x E 




4 


MPIJVait for w = D 0E x E . 




5 


fo = vo~ w 




6 


xo =A ofo 




7 


MPIJsend and MPIJrecv for f E 


= Deoxo- 


8 


fo = Doo*o 




9 


w = w +fo 




10 


MPLWait for f E = D EO x - 




11 


w E = w E + f E 





4. Results 



We benchmark the single precision BiCGStab with the SAP preconditioner on the K computer. 
The lattice sizes benchmarked are 12 3 x 24, 24 3 x 48, and 48 3 x 96. The block size of the SAP is 
kept fixed at 6 4 and the local lattice size in a node is 6 3 x 12. The sub-block size for the SSOR is 
thus 3 4 . The number of nodes used for the benchmark is 16, 256 and 4096 nodes. The performance 
is measured using the profiler provided by the Fujitsu's compiler system. 

Figure ^| shows the SIMD rate in the instruction executed in benchmarking runs. We achieve 
90% SIMD rate for the kernel Dee and Aee by explicitly using the SIMD intrinsic functions. 
At the solver level it reduces to 85% as it contains internode communication. Figs || and ^| rep- 
resents the weak-scaling property of the flops performance. The efficiency (Fig. |3|) is almost at 
constant for the kernels resulting an ideal weak-scaling (Fig. ^ from 16 nodes to 4096 nodes. 
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Figure 3: Same as Fig. but for the performance 
efficiency in a node. 
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Figure 2: Weak-scaling of the SIMD rate of the 
single precision kernels in a node. 

The efficiency reaches over 50% for the Dee 
kernel, while the Aee kernel has lower 35%. 

There are two reasons for the lower per- 
formance for Aee- One is the load imbalance 
of the computation among the threads in A E e- 
For example, the number of red arrows is more 
than that of purple in the left panel of figure [j] 
This means that the core #0 has much compu- 
tation than that of the core #3. The second is 
the less computational density in the loop body 
compared to that of Dee- As mentioned in 
previous section the loop of Dee is unrolled 
by three times, while Aee is by two with sub- 
blocking. The forward/backward substitution 
of Aee only has one sided hopping operation for bulk sites. This also reduces the computa- 
tional density in the loop body. The theoretical peak efficiency for Dee is estimated be 75%(=96 
[GFlops]/128 [GFlops]), while the observed 50% efficiency does not reach the theoretical peak 
efficiency. A reason for this could be the conditional branch to distinguish the sites in the bulk or 
on the surface of the block. We still need a further investigation on the gap between the theoretical 
peak efficiency and the observed efficiency. 

The performance efficiency of the single precision inner solver is at ~26% in all. We observe 
an ideal weak-scaling property as shown in Fig. |j. Note that the measured performance presented 
in the figures contains redundant floating operations coming from the SU (3) reconstruction and the 
use of FMAD for the spin projection. We have to roughly multiply 0.8 on the Flops numbers in 
Fig. ^| to exclude the redundant operations and to obtain the effective performance. 



10 3 3 3 

1 2 J x24 24 J x48 48 3 x96 

Lattice size 

Figure 4: Weak-scaling of the total flops perfor- 
mance. 



5. Summary 

We have benchmarked the single-precision BiCGStab solver for the 0(a) -improved Wilson 
fermion on the K computer. The solver code has been successfully optimized and tuned for the 
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system. We have applied the SSOR with the //-ordered sub-blocking for the approximate block in- 
verse of the SAP preconditioner to enhance the parallelism for the multi-core architecture. Thanks 
to the science specific architecture of th K computer we could achieve the ideal weak-scaling per- 
formance and ~ 26% efficiency for the single-precision BiCGStab solver. It partly remains unclear 
the origin of the gap between the theoretical system performance and the measured performance. 
We expect a further improvement on the internode communication by using the "Tofu" specific 
optimization which is not covered in this study. 
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